The final project Abschlussprojekt consists of two parts:
Part 1 - Data analysis
Part 2 - Dashboard
You work as a data scientist at the P-2-P platform https://www.kiva.org/, which was founded a year ago. Now you want to expand your business. Your team has split up and each analyst has a sub-area of the data. Your task is to find insights for your platform in an explorative data analysis.
Requirements:
- Complete data preparation
- Documentation of the work steps
- Comprehensible and meaningful explanations of procedure/decisions
- for the evaluation: name 3 different types of plots (plot functions)
- Customizing of the plots
- Analysis of the plots
- coding
Your business model is to operate a platform (crowd-investing) where people who have a business idea but not the money needed can register and raise money for their project within a specified period of time.
On the other hand, you have investors who would like to invest their money in projects and are looking for investments.
As an intermediary, your platform brings borrowers and lenders together.
Your database is the history of your platform.
Additional assumptions about the business model
All projects are completed projects, i.e. the time to raise money for these projects has expired. Your business model provides for the collected funds to be paid out even if the target amount has not been reached.
You earn your money with a commission for every project that lands on your platform.
The backer receives interest for lending the money.
- funded_amount ... Amount received/disbursed at the end of the crowding period in USD
- loan_amount ... Target amount (amount you wanted to achieve for the project) in USD
- activity ... Subcategory to which the goal of the crowd project thematically belongs
- sector ... Top category in which the crowd project topic falls
- use ... Brief description of what the money is to be used for
- country_code ... Country code according to ISO standard
- country ... Country name according to ISO standard
- region ... region
- currency ... Currency in which the funded_amount was then paid out
- term in months ... Duration over which the loan is to be disbursed
- lender_count ...lender (i.e. how many people have given money for the project)
- borrower_genders ... Gender and number of borrowers, i.e. those who initiated the crowd project
- repayment interval ... Contractually agreed repayment modalities/rhythm
# load needed libraries
import pandas as pd
import plotly_express as px
# load data as a test to identify the correct separator
df_test = pd.read_csv('data_abschlussprojekt.csv',
sep='/n',
engine='python',
nrows=2
)
df_test
| # funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval | |
|---|---|
| 0 | 0#300.0#300.0#Fruits & Vegetables#Food#To buy ... |
| 1 | 1#575.0#575.0#Rickshaw#Transportation#to repai... |
# load data with the correct separator #
df_kiva = pd.read_csv('data_abschlussprojekt.csv',
sep='#',
engine='python',
index_col=0,
skipinitialspace=True
)
df_kiva
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 671202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671205 rows × 13 columns
Preliminary considerations on the data types and possible optimizations for later
- funded_amount --> float
- loan_amount --> float
- activity --> str --> cat?
- sector --> str --> cat?
- use --> str
- country_code --> str --> cat or delete?
- country --> str --> cat?
- region --> str
- currency --> str --> cat?
- term in months --> float
- lender_count --> int
- borrower_genders --> str
- repayment interval --> str
--> later split the column borrower_genders into 2 columns --> borrower_female_count and borrower_male_count --> int
# show data types
df_kiva.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 borrower_genders object repayment_interval object dtype: object
Evaluation of data types
The data types are as expected and consistent at first glance.
# show column names
df_kiva.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
# show general information about the DataFrame
df_kiva.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 671205 entries, 0 to 671204 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 671205 non-null float64 1 loan_amount 671205 non-null float64 2 activity 671205 non-null object 3 sector 671205 non-null object 4 use 666972 non-null object 5 country_code 671197 non-null object 6 country 671205 non-null object 7 region 614405 non-null object 8 currency 671205 non-null object 9 term_in_months 671205 non-null float64 10 lender_count 671205 non-null int64 11 borrower_genders 666984 non-null object 12 repayment_interval 671205 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 71.7+ MB
Evaluation
# first statistical evaluation
df_kiva.describe()
| funded_amount | loan_amount | term_in_months | lender_count | |
|---|---|---|---|---|
| count | 671205.000000 | 671205.000000 | 671205.000000 | 671205.000000 |
| mean | 785.995061 | 842.397107 | 13.739022 | 20.590922 |
| std | 1130.398941 | 1198.660073 | 8.598919 | 28.459551 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 |
| 25% | 250.000000 | 275.000000 | 8.000000 | 7.000000 |
| 50% | 450.000000 | 500.000000 | 13.000000 | 13.000000 |
| 75% | 900.000000 | 1000.000000 | 14.000000 | 24.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 |
Evaluation
The max values for funded_amount and loan_amount are the same and are both outside the value ranges. They will be examined in more detail later.
The max values of the other two columns term_in_months and lender_count are also outside their value ranges and will be examined in more detail later.
The other values appear to be plausible so far.
Definition of a duplicate
As data records can be very similar to a large extent due to the many columns with only a few unique values and no distinct values such as date and time, all columns will be used to detect a duplicate.
Identification
# identification of duplicates - all
df_kiva.loc[df_kiva.duplicated(subset=df_kiva,
keep=False)].sort_values(ascending=True, by=['funded_amount', 'loan_amount', 'use'])
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671167 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 671169 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 671194 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 671168 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671172 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 542600 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 133 | male | bullet |
| 542790 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 133 | male | bullet |
| 542887 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 131 | female | bullet |
| 542922 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 139 | male | bullet |
| 543030 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 139 | male | bullet |
34930 rows × 13 columns
# display only the values that are duplicated and should be deleted if necessary
df_kiva.loc[df_kiva.duplicated(subset=df_kiva,
keep='first')].sort_values(ascending=True, by=['funded_amount', 'loan_amount', 'use'])
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671169 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 671194 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 671172 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671197 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671199 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 499778 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 11.0 | 131 | female | bullet |
| 502232 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 11.0 | 130 | male | bullet |
| 542790 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 133 | male | bullet |
| 542887 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 131 | female | bullet |
| 543030 | 3600.0 | 3600.0 | Poultry | Agriculture | To buy feed for chicken | TH | Thailand | Chiang Rai | THB | 14.0 | 139 | male | bullet |
24588 rows × 13 columns
# percentage of duplicates
# number of lines in the entire DataFrame
count_rows = df_kiva.shape[0]
print(count_rows)
# number of duplicates
count_rows_duplicated = df_kiva.loc[df_kiva.duplicated(subset=df_kiva, keep='first'), 'funded_amount'].count()
print(count_rows_duplicated)
p = round((count_rows_duplicated/count_rows)*100, 2)
print('Prozentsatz:', p, '%')
671205 24588 Prozentsatz: 3.66 %
# display of eye-catching description texts for three examples
print(df_kiva.iloc[671169][4])
print(df_kiva.iloc[671172][4])
print(df_kiva.iloc[671199][4])
Kiva Coordinator fixed issue loan (no longer vague). Pretend the issue with loan got addressed by Kiva Coordinator. [True, u'para compara: cemento, arenya y ladriollo para construir una pila.'] - this loan use has been approved by VIVA QA
Handling
There are 34930 data records, each of which occurs at least twice. 24588 rows of these are pure duplicates and should be deleted if necessary - that is approx. 3.66 %.
As all columns were used to identify the duplicates, these data records are clearly to be regarded as duplicates. Theoretically, a data record corresponding to a crowd-investing project can presumably appear several times by the same person for the same purpose for the same amount of money. However, the relationship between the funded_amount, loan_amount and lender_count columns in particular will most likely be different each time.
The above three examples of description texts also suggest that these examples are edited or commented data records by Kiva employees. It is therefore possible that several versions of a project exist as different datasets due to certain edits, which could explain the number of duplicates found.
As only part of the data is available, a decision can currently only be made on the basis of this data.
Based on the above considerations, the duplicates will be deleted.
# deleting the duplicates
df_kiva.drop_duplicates(subset=df_kiva, inplace=True)
# reset index
df_kiva.reset_index(drop=True, inplace=True)
df_kiva
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646612 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 646613 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | female | monthly |
| 646614 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly |
| 646615 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly |
| 646616 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly |
646617 rows × 13 columns
# check whether all duplicates have been removed
df_kiva.loc[df_kiva.duplicated(subset=df_kiva, keep='first')]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval |
|---|
The following columns with missing values have already been identified:
Identification
# Nullwerte identifizieren
df_kiva.isna().sum().sort_values(ascending=False)
region 56158 use 3900 borrower_genders 3888 country_code 8 funded_amount 0 loan_amount 0 activity 0 sector 0 country 0 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 dtype: int64
# missing values in percentages
df_kiva.isna().mean().round(6).sort_values(ascending=False)*100
region 8.6849 use 0.6031 borrower_genders 0.6013 country_code 0.0012 funded_amount 0.0000 loan_amount 0.0000 activity 0.0000 sector 0.0000 country 0.0000 currency 0.0000 term_in_months 0.0000 lender_count 0.0000 repayment_interval 0.0000 dtype: float64
# find placeholders - display the number of unique values in each column
df_kiva.nunique()
funded_amount 610 loan_amount 479 activity 163 sector 15 use 423452 country_code 86 country 87 region 12695 currency 67 term_in_months 148 lender_count 503 borrower_genders 11298 repayment_interval 4 dtype: int64
df_kiva.loc[:, 'activity'].unique()
array(['Fruits & Vegetables', 'Rickshaw', 'Transportation', 'Embroidery',
'Milk Sales', 'Services', 'Dairy', 'Beauty Salon', 'Manufacturing',
'Food Production/Sales', 'Wholesale', 'General Store',
'Clothing Sales', 'Poultry', 'Tailoring', 'Sewing', 'Bakery',
'Restaurant', 'Food Stall', 'Farming', 'Construction Supplies',
'Personal Products Sales', 'Home Products Sales',
'Natural Medicines', 'Fish Selling', 'Education provider',
'Shoe Sales', 'Machinery Rental', 'Butcher Shop', 'Pigs',
'Personal Expenses', 'Food Market', 'Cosmetics Sales',
'Personal Housing Expenses', 'Retail', 'Energy', 'Grocery Store',
'Construction', 'Agriculture', 'Motorcycle Transport',
'Charcoal Sales', 'Food', 'Pharmacy', 'Fishing', 'Timber Sales',
'Cattle', 'Electronics Repair', 'Electronics Sales', 'Vehicle',
'Cafe', 'Blacksmith', 'Higher education costs', 'Used Clothing',
'Fuel/Firewood', 'Upholstery', 'Catering', 'Animal Sales',
'Cereals', 'Vehicle Repairs', 'Arts',
'Cloth & Dressmaking Supplies', 'Mobile Phones', 'Spare Parts',
'Clothing', 'Metal Shop', 'Barber Shop', 'Furniture Making',
'Crafts', 'Home Energy', 'Home Appliances', 'Wedding Expenses',
'Taxi', 'Secretarial Services', 'Livestock', 'Property',
'Recycling', 'Farm Supplies', 'Auto Repair', 'Beverages',
'Plastics Sales', 'Electrical Goods', 'Carpentry', 'Photography',
'Jewelry', 'Bricks', 'Pub', 'Phone Use Sales',
'Water Distribution', 'Paper Sales', 'Computers',
'Liquor Store / Off-License', 'Utilities', 'Knitting', 'Weaving',
'Party Supplies', 'Medical Clinic', 'Internet Cafe',
'Consumer Goods', 'Cement', 'Electrician',
'Primary/secondary school costs', 'Veterinary Sales',
'Land Rental', 'Laundry', 'Call Center', 'Perfumes', 'Hotel',
'Motorcycle Repair', 'Movie Tapes & DVDs', 'Quarrying',
'Personal Medical Expenses', 'Bookstore', 'Decorations Sales',
'Recycled Materials', 'Office Supplies', 'Souvenir Sales',
'Renewable Energy Products', 'Health', 'Printing', 'Phone Repair',
'Traveling Sales', 'Flowers', 'Bicycle Repair', 'Entertainment',
'Phone Accessories', 'Hardware', 'Used Shoes',
'Music Discs & Tapes', 'Games', 'Balut-Making', 'Textiles',
'Child Care', 'Goods Distribution', 'Florist', 'Cobbler', 'Dental',
'Bookbinding', 'Cheese Making', 'Bicycle Sales', 'Well digging',
'Technology', 'Musical Performance', 'Waste Management', 'Film',
'Tourism', 'Musical Instruments', 'Religious Articles',
'Machine Shop', 'Cleaning Services', 'Sporting Good Sales',
'Patchwork', 'Funerals', 'Air Conditioning', 'Communications',
'Adult Care', 'Landscaping / Gardening', 'Aquaculture',
'Beekeeping', 'Event Planning', 'Celebrations', 'Computer',
'Personal Care Products', 'Mobile Transactions'], dtype=object)
df_kiva.loc[:, 'sector'].unique()
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture',
'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction',
'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'],
dtype=object)
df_kiva.loc[:, 'country_code'].unique()
array(['PK', 'IN', 'KE', 'NI', 'SV', 'TZ', 'PH', 'PE', 'SN', 'KH', 'LR',
'VN', 'IQ', 'HN', 'PS', 'MN', 'US', 'ML', 'CO', 'TJ', 'GT', 'EC',
'BO', 'YE', 'GH', 'SL', 'HT', 'CL', 'JO', 'UG', 'BI', 'BF', 'TL',
'ID', 'GE', 'UA', 'XK', 'AL', 'CD', 'CR', 'SO', 'ZW', 'CM', 'TR',
'AZ', 'DO', 'BR', 'MX', 'KG', 'AM', 'PY', 'LB', 'WS', 'IL', 'RW',
'ZM', 'NP', 'CG', 'MZ', 'ZA', 'TG', 'BJ', 'BZ', 'SR', 'TH', 'NG',
'MR', 'VU', 'PA', 'VI', 'VC', 'LA', 'MW', 'MM', 'MD', 'SS', 'SB',
'CN', 'EG', 'GU', 'AF', 'MG', nan, 'PR', 'LS', 'CI', 'BT'],
dtype=object)
df_kiva.loc[:, 'country'].unique()
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
'Vanuatu', 'Panama', 'Virgin Islands',
'Saint Vincent and the Grenadines',
"Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
df_kiva.loc[:, 'currency'].unique()
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
df_kiva.loc[:, 'term_in_months'].sort_values(ascending=True).unique()
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22.,
23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33.,
34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.,
45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66.,
67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88.,
89., 90., 91., 92., 93., 94., 95., 96., 97., 98., 99.,
100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
111., 112., 113., 114., 115., 116., 118., 120., 121., 122., 123.,
124., 125., 126., 127., 128., 129., 130., 131., 132., 133., 134.,
135., 136., 137., 138., 139., 141., 142., 143., 144., 145., 146.,
147., 148., 154., 156., 158.])
None of these values has a decimal place, the data type of the column can be converted to integer later on.
df_kiva.loc[:, 'repayment_interval'].unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
Handling
country_code
The unique values in the column country_code and country show that the missing value corresponds to the country 'Namibia'. As the content of the country_code column matches that of the country column and therefore has no added value for the following analyses, the country_code column will be deleted. Filling in the values would be unnecessary and time-consuming.
# deleting the column country_code
df_kiva.drop(columns='country_code', inplace=True)
# check that the column was deleted
df_kiva.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use', 'country',
'region', 'currency', 'term_in_months', 'lender_count',
'borrower_genders', 'repayment_interval'],
dtype='object')
use
As the use column contains a short description (short text) and only explains the information in the sector and activity columns in more detail, it will be deleted as well.
However, this column could be interesting if it is examined for patterns using text mining. In case this becomes relevant in the subsequent analysis, a new DataFrame will be created so that the old one including this column is retained.
# deleting the column use
df_kiva2 = df_kiva.drop(columns='use')
# check that the column was deleted
df_kiva2.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country',
'region', 'currency', 'term_in_months', 'lender_count',
'borrower_genders', 'repayment_interval'],
dtype='object')
borrower_genders
This column contains data that will cause problems in subsequent analyses or at the latest for the EDA. For example, one cell contains 'female, female', so it contains the gender of each person who is a borrower. In fact, this is basically a listing.
It is assumed that the unique values are female and male, which are now split into two columns, each containing the number of people as an integer:
If there are further values, a provisional column (borrower_rest_count) is created for this, which is checked at the end and then deleted.
On this basis, the system checks what the respective unique values are and how they are distributed. The final handling is decided on the basis of this.
# Create new columns and fill them with the number zero
df_kiva2.loc[:, 'borrower_female_count'] = 0
df_kiva2.loc[:, 'borrower_male_count'] = 0
df_kiva2.loc[:, 'borrower_rest_count'] = 0
df_kiva2.head()
| funded_amount | loan_amount | activity | sector | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | borrower_female_count | borrower_male_count | borrower_rest_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular | 0 | 0 | 0 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular | 0 | 0 | 0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | India | Maynaguri | INR | 43.0 | 6 | female | bullet | 0 | 0 | 0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular | 0 | 0 | 0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly | 0 | 0 | 0 |
# How do I get my object? - Testing Area
borrowers = df_kiva2.iloc[(1, 9)]
borrowers
'female, female'
# How do I get my individual values? - Testing Area
# separate the values at the comma with a space
gender = borrowers.strip().split(', ')
print(gender) # The result is a list with strings
# single value
gender[0]
['female', 'female']
'female'
# cell has only one value in the example female, maximum length is 6 - Testing Area
len(df_kiva2.iloc[(0, 9)])
6
# Write number of values in new column - Testing Area
# for only one value
# df_kiva2.iloc[(0, 11)] = 1 # if female, for male [12]
df_kiva2.head()
# For multiple values, i.e. a list
list = df_kiva2.iloc[(1, 9)].strip().split(', ') # Specify the length of the list, so count the objects in the list
list.count('female') # Number of the string female within the list
# for missing values
# df_kiva2.iloc[(1, 11)] = 0
# df_kiva2.head()
pd.isna(df_kiva2.iloc[(140, 9)]) # example for np.nan in row 140
True
# if the value is not a string - Testing Area
type(df_kiva2.iloc[(0, 9)]) != str
False
# Checking for np.nan via isna() does not seem to work, so the values are replaced by -1
#df_kiva2.fillna({'borrower_genders': -1}, inplace=True)
#df_kiva2.iloc[140, 9] == -1
#type(df_kiva2.iloc[140, 9]) == int
type(df_kiva2.iloc[140, 9]) == float
pd.isna(df_kiva2.iloc[140, 9])
True
# for-loop to split the column borrower_genders into new columns
# Go through all rows in the column with index 9
rest = {} # if there are values that are neither string nor 'nan', they are written further down into a dict
for index in range(len(df_kiva2)):
# if it is a float
if type(df_kiva2.iloc[(index, 9)]) == float:
# if it is a np.nan, write the same into the columns borrower_male_count and borrower_female_count
if pd.isna(df_kiva2.iloc[(index, 9)]):
df_kiva2.iloc[(index, 11)] = df_kiva2.iloc[(index, 9)]
df_kiva2.iloc[(index, 12)] = df_kiva2.iloc[(index, 9)]
# if it is a float (and not a np.nan), write the value into the column borrower_rest_count
else:
df_kiva2.iloc[(index, 13)] = df_kiva2.iloc[(index, 9)]
# and put the index as a string together with the value into the dictionary rest
rest[str(index)] = df_kiva2.iloc[(index, 9)]
# if it is a string
if type(df_kiva2.iloc[(index, 9)]) == str:
# if the length is 4, write 1 into column borrower_male_count
if len(df_kiva2.iloc[(index, 9)]) == 4:
df_kiva2.iloc[(index, 12)] = 1
# if the length is 6, write 1 into column borrower_female_count
if len(df_kiva2.iloc[(index, 9)]) == 6:
df_kiva2.iloc[(index, 11)] = 1
# otherwise separate the values at comma with a space and save this in the variable gender_list
else:
gender_list = df_kiva2.iloc[(index, 9)].strip().split(', ')
# calculate the number of male and female and write the number in the appropriate column
count_male = gender_list.count('male')
df_kiva2.iloc[(index, 12)] = count_male
count_female = gender_list.count('female')
df_kiva2.iloc[(index, 11)] = count_female
# otherwise put the value into column borrower_rest_count
#else:
# df_kiva2.iloc[(index, 13)] = df_kiva2.iloc[(index, 9)]
# and put the index as a string together with the value into the dictionary rest
# rest[str(index)] = df_kiva2.iloc[(index, 9)]
print('Done!')
Done!
# Check if everything worked
print(rest)
print(df_kiva2.iloc[140, 9]) # check whether nan value has been transferred correctly
df_kiva2.tail()
{}
nan
| funded_amount | loan_amount | activity | sector | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | borrower_female_count | borrower_male_count | borrower_rest_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 646612 | 0.0 | 25.0 | Livestock | Agriculture | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly | 1.0 | 0.0 | 0 |
| 646613 | 0.0 | 25.0 | Livestock | Agriculture | Pakistan | Attock | PKR | 13.0 | 0 | female | monthly | 1.0 | 0.0 | 0 |
| 646614 | 0.0 | 125.0 | Livestock | Agriculture | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly | 2.0 | 0.0 | 0 |
| 646615 | 0.0 | 875.0 | Livestock | Agriculture | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly | 2.0 | 0.0 | 0 |
| 646616 | 0.0 | 250.0 | Livestock | Agriculture | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly | 1.0 | 0.0 | 0 |
The dictionary rest and therefore also the column borrower_rest_count are empty, so there were no unexpected values.
# check unique values of female borrowers
df_kiva2.loc[:, 'borrower_female_count'].unique()
array([ 1., 2., 3., 8., 0., 9., 5., 4., 13., 12., 7., nan, 10.,
6., 21., 18., 30., 11., 27., 31., 34., 23., 22., 15., 17., 32.,
36., 19., 14., 25., 20., 33., 40., 24., 26., 39., 28., 16., 35.,
38., 29., 37., 45., 44., 50., 43., 48., 41., 42., 49.])
# check unique values of male borrowers
df_kiva2.loc[:, 'borrower_male_count'].unique()
array([ 0., 1., 5., 2., 3., nan, 8., 4., 10., 7., 9., 6., 25.,
13., 12., 11., 18., 16., 21., 24., 15., 19., 23., 31., 17., 26.,
14., 22., 20., 28., 29., 33., 36., 35., 32., 34., 40., 27., 44.,
39., 30.])
# check unique values of the column with the rest
df_kiva2.loc[:, 'borrower_rest_count'].unique()
array([0], dtype=int64)
The dictionary rest and therefore also the borrower_rest_count column are empty, so there were no unexpected values. The column and the borrower_genders column, which is no longer required, will be deleted below.
# deleting the columns borrower_rest_count and borrower_genders
df_kiva2.drop(columns=['borrower_genders', 'borrower_rest_count'], inplace=True)
df_kiva2.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country',
'region', 'currency', 'term_in_months', 'lender_count',
'repayment_interval', 'borrower_female_count', 'borrower_male_count'],
dtype='object')
# count of each unique value in borrower_male_count
df_kiva2.loc[:, 'borrower_male_count'].value_counts()
0.0 470903 1.0 144126 2.0 8130 3.0 5567 4.0 3830 5.0 3077 6.0 2176 7.0 1514 8.0 1115 9.0 719 10.0 509 11.0 308 12.0 192 13.0 137 14.0 89 15.0 62 16.0 49 17.0 42 18.0 34 19.0 28 20.0 21 22.0 19 21.0 13 25.0 13 23.0 12 24.0 8 27.0 6 26.0 6 28.0 5 36.0 3 33.0 3 29.0 3 31.0 3 35.0 1 32.0 1 34.0 1 40.0 1 44.0 1 39.0 1 30.0 1 Name: borrower_male_count, dtype: int64
# count of each unique value in borrower_female_count
df_kiva2.loc[:, 'borrower_female_count'].value_counts()
1.0 420065 0.0 133307 3.0 15054 2.0 15047 4.0 12168 5.0 9420 7.0 5789 6.0 5581 8.0 4370 10.0 3359 9.0 3074 17.0 1888 11.0 1607 15.0 1511 18.0 1205 12.0 1181 16.0 1079 19.0 925 13.0 860 20.0 858 14.0 723 21.0 590 22.0 490 23.0 440 25.0 355 24.0 349 26.0 224 27.0 201 28.0 172 30.0 157 29.0 117 31.0 102 33.0 75 32.0 71 34.0 55 37.0 52 35.0 51 36.0 45 38.0 26 40.0 18 39.0 16 42.0 15 41.0 11 43.0 9 44.0 5 48.0 5 45.0 4 50.0 2 49.0 1 Name: borrower_female_count, dtype: int64
# sum of current missing values
df_kiva2.isna().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 country 0 region 56158 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female_count 3888 borrower_male_count 3888 dtype: int64
How to further deal with missing values:
In both columns, the most common values by far are 1 and 0. In order to be able to continue working with this data, the missing values will be filled with the respective mean value in dependence on another column.
The sector column will be used for this, as it describes the content of the project and it is to be expected that gender is related to this. In addition, the column has relatively few unique values compared to other columns, so that the filling will be relatively quick.
# Group the two columns with the number of genders according to the sector column and calculate the mean value in each case
# round the result
gender_group = df_kiva2.groupby(by=['sector'],
as_index=False,
observed=True).agg({'borrower_female_count': 'mean', 'borrower_male_count': 'mean'}).round()
gender_group
| sector | borrower_female_count | borrower_male_count | |
|---|---|---|---|
| 0 | Agriculture | 1.0 | 1.0 |
| 1 | Arts | 2.0 | 0.0 |
| 2 | Clothing | 2.0 | 0.0 |
| 3 | Construction | 1.0 | 1.0 |
| 4 | Education | 1.0 | 0.0 |
| 5 | Entertainment | 1.0 | 1.0 |
| 6 | Food | 2.0 | 0.0 |
| 7 | Health | 1.0 | 0.0 |
| 8 | Housing | 1.0 | 0.0 |
| 9 | Manufacturing | 1.0 | 0.0 |
| 10 | Personal Use | 2.0 | 1.0 |
| 11 | Retail | 2.0 | 0.0 |
| 12 | Services | 1.0 | 0.0 |
| 13 | Transportation | 1.0 | 1.0 |
| 14 | Wholesale | 1.0 | 1.0 |
# testing area
test = (gender_group.loc[gender_group.loc[:, 'sector'] == 'Food', 'borrower_female_count'].item()) # should be 2.0
test
type(test) # should be a float
float
# fill the nan values depending on the column sector - borrower_female_count
for sector in df_kiva2.loc[df_kiva2.loc[:, 'borrower_female_count'].isna(), 'sector'].unique():
female_count = (gender_group.loc[gender_group.loc[:, 'sector'] == sector, 'borrower_female_count'].item())
df_kiva2.loc[(df_kiva2.loc[:, 'sector'] == sector) & (df_kiva2.loc[:, 'borrower_female_count'].isna()), 'borrower_female_count'] = female_count
df_kiva2.isna().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 country 0 region 56158 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female_count 0 borrower_male_count 3888 dtype: int64
# fill the nan values depending on the column sector - borrower_male_count
for sector in df_kiva2.loc[df_kiva2.loc[:, 'borrower_male_count'].isna(), 'sector'].unique():
male_count = (gender_group.loc[gender_group.loc[:, 'sector'] == sector, 'borrower_male_count'].item())
df_kiva2.loc[(df_kiva2.loc[:, 'sector'] == sector) & (df_kiva2.loc[:, 'borrower_male_count'].isna()), 'borrower_male_count'] = male_count
df_kiva2.isna().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 country 0 region 56158 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female_count 0 borrower_male_count 0 dtype: int64
# Outdated code!!!
# filling the nan values with the respective mean value
#f_mean = df_kiva2.loc[:, 'borrower_female_count'].mean().round()
#m_mean = df_kiva2.loc[:, 'borrower_male_count'].mean().round()
#df_kiva2.loc[:, 'borrower_female_count'] = df_kiva2.loc[:, 'borrower_female_count'].fillna(value=f_mean)
#df_kiva2.loc[:, 'borrower_male_count'] = df_kiva2.loc[:, 'borrower_male_count'].fillna(value=m_mean)
# check
#df_kiva2.isna().sum()
region
This column is missing the most values with approx. 8.7 %. As the region is dependent on the country and there is therefore a large number of unique values, it would take too much time to fill in the missing values in dependency. In addition, according to the information from the data dictionary, it is not clear whether this could also involve several regions.
As the data is still to be used, the missing region will be filled in with the country due to its dependency on it.
# exact number of unique values in the region column
df_kiva2.loc[:, 'region'].nunique()
12695
# testing area
df_kiva2.loc[df_kiva2.loc[:, 'region'].isna(), 'country'].unique()
#country = 'Kenya'
#df_kiva2.loc[(df_kiva2.loc[:, 'country'] == country) & (df_kiva2.loc[:, 'region'].isna()), 'region'] = 'Kenya'
#df_kiva2.loc[df_kiva2.loc[:, 'region'] == 'Kenya']
# Are there already data sets where country equals region?
df_kiva2.loc[df_kiva2.loc[:, 'region'] == df_kiva2.loc[:, 'country']]
| funded_amount | loan_amount | activity | sector | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female_count | borrower_male_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41414 | 1300.0 | 1300.0 | Livestock | Agriculture | Guatemala | Guatemala | GTQ | 10.0 | 47 | monthly | 2.0 | 0.0 |
| 165448 | 1850.0 | 1850.0 | Cattle | Agriculture | Guatemala | Guatemala | GTQ | 14.0 | 27 | monthly | 2.0 | 0.0 |
| 318067 | 1050.0 | 1050.0 | Bakery | Food | Guatemala | Guatemala | GTQ | 8.0 | 40 | monthly | 4.0 | 0.0 |
| 332133 | 525.0 | 525.0 | Bookstore | Retail | Guatemala | Guatemala | GTQ | 6.0 | 20 | irregular | 2.0 | 0.0 |
| 336960 | 400.0 | 400.0 | Grocery Store | Food | Guatemala | Guatemala | GTQ | 8.0 | 16 | monthly | 2.0 | 0.0 |
| 356698 | 1850.0 | 1850.0 | Pharmacy | Health | Guatemala | Guatemala | GTQ | 14.0 | 70 | monthly | 0.0 | 2.0 |
| 494261 | 12700.0 | 12700.0 | Education provider | Education | Guatemala | Guatemala | USD | 25.0 | 409 | irregular | 1.0 | 0.0 |
| 509027 | 10000.0 | 10000.0 | Recycled Materials | Retail | Guatemala | Guatemala | USD | 25.0 | 340 | irregular | 0.0 | 1.0 |
| 534338 | 1875.0 | 1875.0 | Pharmacy | Health | Guatemala | Guatemala | GTQ | 14.0 | 56 | monthly | 0.0 | 2.0 |
| 559627 | 13000.0 | 13000.0 | Services | Services | Guatemala | Guatemala | USD | 25.0 | 380 | irregular | 0.0 | 1.0 |
| 567792 | 1175.0 | 1175.0 | Grocery Store | Food | Guatemala | Guatemala | GTQ | 8.0 | 46 | monthly | 1.0 | 1.0 |
| 569267 | 475.0 | 475.0 | Food Stall | Food | Guatemala | Guatemala | GTQ | 6.0 | 9 | irregular | 2.0 | 0.0 |
| 586750 | 1725.0 | 1725.0 | Clothing Sales | Clothing | Guatemala | Guatemala | GTQ | 8.0 | 30 | monthly | 7.0 | 0.0 |
| 599186 | 550.0 | 550.0 | Services | Services | Guatemala | Guatemala | GTQ | 6.0 | 22 | irregular | 0.0 | 2.0 |
| 623756 | 14300.0 | 14300.0 | Textiles | Arts | Guatemala | Guatemala | USD | 24.0 | 521 | irregular | 0.0 | 1.0 |
There are already 15 data records that have the same value in country and region. This is only the case for Guatemala.
# fill in the missing values with the value from column country
for country in df_kiva2.loc[df_kiva2.loc[:, 'region'].isna(), 'country'].unique():
df_kiva2.loc[(df_kiva2.loc[:, 'country'] == country) & (df_kiva2.loc[:, 'region'].isna()), 'region'] = country
# Check change - are there any missing values?
df_kiva2.loc[df_kiva2.loc[:, 'region'].isna()]
| funded_amount | loan_amount | activity | sector | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female_count | borrower_male_count |
|---|
# Check change - where are region and country the same?
df_kiva2.loc[df_kiva2.loc[:, 'region'] == df_kiva2.loc[:, 'country']]
| funded_amount | loan_amount | activity | sector | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female_count | borrower_male_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 250.0 | 250.0 | Services | Services | Kenya | Kenya | KES | 4.0 | 6 | irregular | 1.0 | 0.0 |
| 49 | 450.0 | 450.0 | General Store | Retail | El Salvador | El Salvador | USD | 14.0 | 18 | monthly | 0.0 | 1.0 |
| 54 | 225.0 | 225.0 | Food Market | Food | Senegal | Senegal | XOF | 14.0 | 7 | monthly | 1.0 | 0.0 |
| 67 | 125.0 | 125.0 | Energy | Services | Kenya | Kenya | KES | 3.0 | 6 | irregular | 0.0 | 1.0 |
| 70 | 2000.0 | 2000.0 | Retail | Retail | Iraq | Iraq | USD | 15.0 | 71 | monthly | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646606 | 25.0 | 25.0 | Livestock | Agriculture | Kenya | Kenya | KES | 13.0 | 1 | monthly | 1.0 | 0.0 |
| 646607 | 0.0 | 25.0 | Livestock | Agriculture | Kenya | Kenya | KES | 13.0 | 0 | monthly | 2.0 | 0.0 |
| 646608 | 0.0 | 25.0 | Livestock | Agriculture | Kenya | Kenya | KES | 13.0 | 0 | monthly | 1.0 | 0.0 |
| 646609 | 0.0 | 25.0 | Games | Entertainment | Kenya | Kenya | KES | 13.0 | 0 | monthly | 2.0 | 0.0 |
| 646610 | 0.0 | 25.0 | Games | Entertainment | Kenya | Kenya | KES | 13.0 | 0 | monthly | 1.0 | 1.0 |
56173 rows × 12 columns
Already noted above:
The max values for funded_amount and loan_amount are the same and are both outside the value ranges. They will be examined in more detail later.
The max values of the other two columns term_in_months and lender_count are also outside their value ranges and will be examined in more detail later as well.
Identification
# short statistics
df_kiva2.describe()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_female_count | borrower_male_count | |
|---|---|---|---|---|---|---|
| count | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 |
| mean | 808.045242 | 866.458816 | 13.719826 | 21.115360 | 1.613281 | 0.415136 |
| std | 1145.419041 | 1214.276106 | 8.479912 | 28.840674 | 3.077798 | 1.132761 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 275.000000 | 275.000000 | 8.000000 | 7.000000 | 1.000000 | 0.000000 |
| 50% | 475.000000 | 500.000000 | 13.000000 | 13.000000 | 1.000000 | 0.000000 |
| 75% | 925.000000 | 1000.000000 | 14.000000 | 25.000000 | 1.000000 | 1.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 | 50.000000 | 44.000000 |
# short visualization via histogram - funded_amount and loan_amount
hist_funded_loan = px.histogram(data_frame=df_kiva2.loc[:, ['funded_amount', 'loan_amount']],
y=['funded_amount', 'loan_amount'],
title='Distribution of data on target amount and amount actually achieved in USD'
)
hist_funded_loan.show()